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scpres are investigated in light of their usefulness. The history of 
the grade equivaleiv^..-^ores u;§^d in standardized tests and in 
readability, scprS^^anSlDe , perceived as reflecting a circular, or 
skyhook relationship be| " " - - 

testing procedure consx/s 
where a studenji 
until a ceilii 

factor and th^ guess factor. In addition, such a system? canVpreserye 
the efficiency o^ the group format and make^ test results more 
interpjre table to the. specialist, as "well as more ^understandable to 
the child; (KS) • | 

\ ^ . ... , . I /. i ■ 



i^een these sicores -and .curricular ^ material. A 
^ting of items beginning at a basaO. Iqvel, * 
was convinced of his or her mastery, and ^continuing 
of error i^ reached can eliminate the ijiattention 
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Grade Level Expectations and Grade Equivalent Scores' 

in Reading Tests. 

In 1924 E. A. Lincoln declared: "It Is probably* 
that we shall never know the real beginning of the use of 
standard .tests", (p. 12). The movement began perhaps as* . 
early as 1575, according to Linden & Linden (1968, p. 2),. ' 
and included, much, later. Rev. Fisher's and Dr. Rice's 
desires for scaled me a,'%ur^^?^f school children's progress 
year by year. Tests wfer.^;iieeded in the early "scientific" 
days of the twentieth centur^r# claimed Monroe (1917, 71), 

t* ■ : . * . ■ . ■ (ft , 

because of ithe inconsistency and inaccuracy of teacher 

■ . • ■. ■ • " 

marks or grades. Certainly this desire formofe accurate 
and consisjtent quantification of children's^ performances 
has not lessened.;/ TH^'Tield of reading research has been, 
filled vfitfh test cdn^irudtors and test interpreters 
attempting to reach this a.6curacy and consistencjr-to guide 
the \lmprdveii\ent of childr^' s read^^ \. , 

^ Early test constructors , like Simon and Binet, 
wished tjb devise scales Uhat' increased gradually in 
difflcu]|ty sq that children's performances^ could be ranged 
on a coritinuum. In mental measurement , Simon and Binet 



thought that they could arrangis' tasks by levels appropriate 
to one age group and not to a younger age group ♦ They 
used 67-75 percent accuracy as- their criterion for such a 
stepwise age level discrimination (Linden i Linden, p.- 17) • 
.Reading test experimenters were led in a similar fashion 
to devlseinreading scales, first oral, then silent. Thorndlke 
and others Invented reading scales using passages graduated 
in difficulty so that a reading age or reading grade -might 

established for a child, .A colleague of Thorndlke, 
W. A. McCall, was so enamo tired with the / idea of accurate 
and consistent .measurement on a graded basis t^iat he 
surveyed entire - school populations and advised the proniotion 
and demotion bf children until their ayepage grade. scores 
(usingv conversion to his- standard T score) for the entire 
academic curriculum (the testable part, that is), fit his* 
range of expectations (McCai;L, 1927, p* '67ff.). 

Statistical manipulation a;nd test writing became; 
more'sophisticated, but still the idea clung in the minds, 
of educators and test experts that certain reading tasks 
or levels are j^ipproprlate at one age level that will be 
tod diffi(^ult at another. The problem has remained: 
a clearcufe determination of grade levels for reading/ 
materials^!!5hd reading tasks hal not yet been defined. 

This paper will investigate three uses for grade 




J 



and age equivalent 



scores on silent reading tests; 



2) readability scope's attached to childr^'s reading 



\ 



matter including- tbxtbook^^i and 3) readijig grade 
expectancy scores. Sec^d,7^Tte paper will discuss the 
history'^ of the grade ,eq,uiva.lent scores used in standardized 



tests and in readability scores 4nd will describe the 
argument that a cjlrcular^ or skyhook relationship exists 
between these two scores and curricular material. Finally, 
the paper will p:^opose possible .directions for test 
constructors in the, future. 

Grade or/ age equivalent narms were one way --of ^^....--^ 
departing from/the reporting of» raw score data and 
attempting to cast an interpretation on test scores. 

These norms w^re used to describe thje average^ score 

/ • , ■ . • ■ 

obtained^ by c/hildren taking a t-es-t (Anastasi, 19o8, 
p. 16; McLaughlin^ i960, p. 6). While the grade or age 
equivalent scores for children above and below tt)e mean, 
were usually .determined by extrapolation and interpolation, 
sortie new^r tests haVe been normed on children one grade ' 
above and one' grade below the intended level (Robeck' & 
Wilson, IsiM, p. 370). ^ • 

In development of^ the scores, the age score was 
first- developed as a parallel to the mental age derived 
from mental tests. W. A. McCall referred to Thorndike's 
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wish to continue his open-ended scale: . I 

But measurement continued to |e a matter' 
for experts because scale scores iwere difficult 
to co/n^ute and were generally inQomprehensible* 

To overcome this difficulty th^e writer ^ 
developed and popularized a planlfor having all 
tests yield comparable aiid easily' understood age 
scores such as-^^r^eadinj;^ Age , arithmetic age, 
"^educational age, mental age, proiAotion ag^, and 
the quotients . • • • 

Later the writer^ invented the^ grade scale 
^ yielding G ' scores^c^^^^^^These jproved to be so popular 

. that they, came into^aimost immediate use on most 
tests from New York to Nanking. 

McCall went on to boast* that his "obJeQtively-scorable 

tests yielding age scores or G scores gave measurement to 

the mllliorys and provided the large profits *^ [emphasis mine] 

(1939, P- i^5K^ • 

Grade and age scores lost their separate identity, 
-however, so that in later years reading experts have 
proposed foi*mulas- ignoring any difference^ by adding or 
subtracting 5.^0 from whichever score was not in comparable 
form (Della-Plaho, 1968, p. 41). The grade equivalent'>j 
grevSr to be more preferred and was used in formulas forj^'s^ 
calculating gains in reading ability (Bleismer, 1970), ' 
for pAedicting^and' describing ranges of achievement 
.(MacGinitie, 19Y3-) ^ and in selecting candidates for < 
remedial reading claisses (Harris^ 1971). ^ 

The grade equivalent seemed easj to* interpret, even 
though test experts found areas for cr:.tlcism of this 



uneven interval, non-standard score measurements' Test, 
experts criticized the use of the grade equivalent score 
not only because its range does not include actual scores 
of real children (Spache, I963,' Pi» 360), but also because 
itp^range does not describe how the content which is tested 
is aAtually taught in schools. Anastasi pointed out* that 
a.lx subjects are not given equal emphasis from grade to . 
.grade and that an estimated "grade equivalent** may mask 
the emphasis or neglect of that area in certain grade 
levels (p* 61). Thorndlke and Hagen suggested that a comparl 
with a child's* own age group^^may be more relevant , especially 
if multiple sets of nothing populations, are used and 
described (I960, p. 220)* 

The criticisms above have vascillatqd between 
a desire for normirig populations aboVe and below' the 
grade level for which the test is ^.ntended and a desire 
for a test which Will measure only-v^he curriculum taught 
at that age- level ".ordinarily," whateVer ordinarily may 
mean, given the range of abilities in .any classroom. 

Thorndike, and later writers, felt that gradedness/ 
att^cf^ed to the passages used for reading, ' not just to 
t)^ child's obtained score. Greeiie & Jorgen'sen (1929,-. 
14-15) and Monroe (1917, fip*- 71-72) referred to the 

extensive (for that time ^experimental re-orderlng of " • 

. ■■ ■ " • ■• , ■ ••• X • 

the passages in response to children's performances with 
the tasks. 



Validity studies will be dealt with in the second 

. : ■ ■ ' \ ■ ' \^ 

rpajor section of this paper; however, one criterion test 
Should be mentioned in the^irst section* Part of the 
oral hist^ory of treading instruction has been that a set 



of reading materials^ the McCall-Crabbs' Standard Test 
Lessons in Reading ^ were 'so good that tjests were 
standardized using the lessons as criterion, Joyce 

Kambns, Test Editor for Teachers College Press (Columbia) , 

■ ' ^' \- ■ ■ • . ' ^ ■ 

confirmed Just the opposite: that the Test Lessons used 

the Thomdike^McCall Reading^cale with. Its grr^ade- scores 

as criterion ancJTth at* grade scores wei^e plotted and 

assigned to the L essons from the ^Scales, Ms. Kamons 

■ ■■ i«. 

. \. ■ ■ > ' ■ 

stated that more co;nplete^ ir\formation was not available ' 
(197*t, letter) • ^ / 

These Standard Test Lessons in Reading, with their 
grade scorj^s derived from comparison with the Thorndik^ . 
test, were used as a basis for. the assignment of the 

,. I ■ . * . 

first readability score.s and many others: • v 

The readability formulas developed by 
Lorge, Plesch, and Dale and Chall all used 
the McCall-Crabbs test lessons as a. criterion* 
(Harris, 197^, p* 2) ^ 

Readability scor^, in turn, were used to estimate 

the difficulty of readirig textbooks, reading tests, ■ 

library books, et cetera. . Harris & Jacobson used passages^ 

from many, basal readers as criteria for their basic ' - 

• ' ' ' • . ■ 

revised scale, but also used correJ.atiQji with the - 



McCall-Crabbs for additional validity (Revised • 
Formulas^ p. 5). These writer's indicated that the. formulas 
must be interpreted with a correction factor for realistic 
use by children in reading. 



Readability formulas were designed to help children 
select material on a gradually increasing difficulty level 
as their repdirig ability increases. Thorndike- wg.s 
described as ordering his passages by difficulty from 
empirical evidence^ though modern writers have criticized 
the extreme difficulty of his passages (Tuinman, 1971, 

p. 197).^ Thorndike's test was used as ct'it'Brion. for4the 

• ■ ■ • ' ' , . ■ * '-ft'' 

McCall-Crabbs passage^ which were used as criterion .1|or 

.readability scales. Now Harris and Jacobson (p* .0\.iwiish 

to re-standardize the McCall-Crabbs passages, hoping that 

. * the average comprehension scores^bf 
V children on them will provide another and 
/ ' perhaps better criterion for validating [ 

or improving our readability formulas. 

Reading grade expectancy scores Involved the use of 

grade or age . scores from one 6r more arests other thaiti 

reading;^ Harris (1971, t^p. 113-120) described a number 

of such formulas, i These f^ormulsLs. use calculations 

involving such measures as\ chronological age, mental 

age, reading age, arithmetic age, and years in school. 

Grade scores wei*e used rather than standard^ scores. . 

O'Connor (1972, pp. 78-79) cautioned against lumping \ ■ 

such scores and measures 'and ignoring the standard error 




of measurement between the scores • It is thQ opinion. of 

this writer t^at siich formulas may occasionally be Justified 

if their only use is to admit ^children to a remedial reading 

class for instruction:"lf^_however, such' a formula is used 

to measure attainm^t of a program goal, such as that of 

the 1973 Virginia Standards of Quality, the ^formula appears 

to be invalid. (Program goal. Standards of Quality: 

tie average achievement level of the student 
population in reading .and mathematics .as 
measured by standardized tests will equal or' 
exceed the average ability level of the student 
population as measured^ by scholastic aptitude 
tests. ) , ; . : ' 1 

Conversion of the scores to. standard - scoren would lend 
mordl^. validity to the formulas, as would considering the 
standard errors of measurement* (The quality of the test . 
for the purpose intended is deliberately ignored at this 
point in the argument.) . . 

The basis for the reading grade expectancy score, 
however,! was that a child could be expected to read at 
a level commensurate with his ability, usuall3( inter- 
preted as mental abilAty* Some writers, including 

Sp^cfte & Spache (1969, jpp. 64-69*) have questioned suclt^^ - 

■ ■ \ - ■ ■ ' ■ \ ' . ' . . \ ■ 

a one-to-^'one relationship between mental ability and . \^ 

potential for reading ability.. O'Connor (1972, pp* 78-7^) 

criticized the idea that mental age., in particular, could 

be used as a measure toward which a child could aim in ' 

reading. r. - ( 



\ Since the early twentieth^entury writers have 
commented on the correlati^^ tests and 

intelligence tests (Davis, 1972, p. 635) and Thomdike 
boldly entitled his 1917 artille: "Reading as Reasoning*" 
Intelligence tests have been used to predict potential 
for reading ability for children measured by reading 
^tests which correlate highly with SQores on Iritelligence. 
teSfts. Certainly, the relationship betJeen reading and 
intelligence has not been Sis certained. tChe possible ■ 
.relationships have implications for the t^otal question of 
validity, both for reading tests and for \;he readability 
scores derived from them. "/ 



The APA Standards for educational and psychological 
tests and -manuals (1966) redefined the kindfe of validity 
from predic.tiv^, concurrent, construct, a)|d \content 
validity to content validity, criterion-related validity, 
and construct. validity. A review of tests in^eadlng 
has been published at intervals since I938. Ir^ferences 
ab.out the validity considerations of early reading tests 
./have been drawn from 0. K. Buros' Mental Measuremetygs 
Yearbooks (re-published in one volume with dates W^served 

Wading ; Tests and Reviews). 

.... 1- ■ ^ ^ : , 

I,n the first Yearbook, 1938- statistical vamdAtlon 

sucn^as scores fitting the normal curve, was Tils cussed. 

Predictive, or criterion-related viiliWty, was' consikei'ed 



: / 
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■■■■ :,• .-: ■ ; ■■■ . , ■ ! -.v / V 

when some tests were compared to the later college point 
averages of students. The otjher type of criteriori4fel:at%d 
vialidlty was mentioned when tests were compared with other' - 

tests. It was not until the second Yearbook in 19y**0.that 

" ■ '. • . • . ■.•■/ • ' 

reading tests were criticizer for not describing /how items 

were arrived at (pp. 151fl-1576, 1556) and for not describing 

how the test in auestlloTi distinguished between goxxd and 

....... " • ^ ^ ■ ■ . ■ ' ^' \ • ' 

poor readers (p. 1578)^1 Considerations of constructSand V 
content validity were second consideration^, it would / 

appear. ' . .-^ . \ / ■ , ' - ) . 

. ■ / The ma^ or ^criterion me ntion$5d in all the yearbooks 
was another test " (or tests); One of the earliest t^ests 
used as a criterion .was th'? Triorndike« > McCall Reading Scale . ' 
Thorndike»s original test had open-ended questions, rather 
than the . multiple-choice format that, was chosen by his 
'colleague McCall for the Test Lessons - that' were used as 
criterion for the readability scales (Thorndike, p* 425) . 
Once Thomdike's type '^of questioning X called reasonliig . - 
by himself and others) was converted into mult iple- choice ' 
format, his^est and its descenden^s lent themselves to^ 
being a ready source for. what Ronald P. Carver ( ^1973, 
p. 52) described ds' "face validityv, discriminate reliability 
among individuals ^t any given level, and level-to-level- 
group increments." Carver's overall' criticism was 'that 
Thorndlke's test 'began a circular relationship between 
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reading, tests and tests of reasoning or Intelligence, 
The criticism of Davis (1972, p. 635) was repeated. 

R. T. Lennon (1970, p. 12*3) took a generous view 
toward the validity question. for reading tests; he noted 
tUat, when te^t makers cannot know' or deteinnine the 
universe, they are attempting to measure^ they look for 
internal ccsnsisteney among other tests. Robeck & Wilson 
pointed tp improvements In recexit^jstandardized reading 



tests (197^y pp* 367-371) , but, concluded that : 

The problem of validity plagues all of the 
. test producers since theie is no external 
criterion against which a reading test can be 
validated (p. 376) • 

Te'st constructor (or re-cpnstructfonist) 

J. R* Bomuth (1970, p. 9) criticizeji test authors for 

.ignoring an external criterion. He pointed to the 

circularity- in validation/ one reading testes being 

correlated against anbther against yet another (as * 

described in the circle with the Thorndike test,^ 

other reading tests, and even irrtelligence tests); 

Bormuth reiterated Carver's accusation against ?est 

writers, that face validity arid^ ob^^ 

In the items was more important to them th^n was the 

question of whether the' test does, in fact, measure 

what it was intended to measure. Bormuth^^^^^olution 

was for' test writers- to examine actua^l instruction., in 

the classrooms, then to apply'-certain transformational 

g^.nerSitive grammar rules to develop tes^t items; howeyer 
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hl3 basic criticism — that test auttvors have ignored 
the search for an external criterion-- seems to be 
directly related to this paper^ His accusation of 
circularity in validation has been echoed by many other 
writefs* ' " 

Davis (1972) conducted factor analyses for over 
30 years, attempting to isolate skills and sequences of 
skills in reading comprehension. He criticlzisd teat 
authors fdr being economy'-conscious in item selection 
(cfj McCall, 1939, p. 9). iPyrczak (1972, p* 64) j:ibted 
that many students were able, far beyond chance levels, 
to mark test items on reading tests without reading the 
accompanying passages* ^ Many items could be matked from 
general knowledge and others could be marked because items 
were interrelated (i.e., reading one^ question was a hint 
to the answer of a second question). In "The Assessment 
of Change,-*' Davis suggested <Shoosirig the best among the^-^ 
existing tests, using cautions from a knowledge of 
possible error of measurement, and being willing to 
depend upon tentative data (i970, pp. 326-339). 

Carver (p. 53) described the ultimate in circularity 
of validation: ETS's National Anchor Test Equaling Study 
in Reading, in which seven existing norra-'refet'enced tests 
in reading were to) b€^ade: statistically more equal, " 
Combine this equation with the relationship between reading. 
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tests and readability scales— -and ttjef child who has not 
yJbeen able to achieve In curre^it^/tilirrlcular materl will 
be promised little. The r^^abllity scale used to devisee , 
and measure his textbojol^ls tied to his reading test which 
Is tied to other r;e^dlng tests; from hi? tests and the 
readability scales come other textb'ooks, and sometimes 
even his library books. (And his potential for reading 
is estimated from intelligence tests to which, it might 
be "supposed, the Anchor Test Equat^^n^ method might, be 
tied.) • • 

Critics like Davis and Carver have alerted educators 
to the problem of test circularity. Robeck & Wilson, along 
with Euros, have described ways of devising better rerSding 
tests whlph attempt to measure tasks which require reading 
in ^addition to reaabnlng or intelligence tasks which can 
be done without reading the passages on the tests. 

Criterion-referenced .test; -writers like Wayne Otto 
' (1974) have attempted to examine' tne cUrriculiim, to 
obtain teacher and developmental psychologist opinions, 
and to validate skll^, sequences so that tests will tie 
reading tasks discretely • Suggestions that "reading*^- 
can be divided into bits and pieces have been objected to 
by other wr^d^ters, to the extent that Kenneth Goodman (1973) 
suggested a reversion to the "inefficiency" of older 
t'ests,- Goodman Implied that. If yalldlty Is desired. 
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reading may have to be evaluated in the natural setting^ 
perhaps even in a one-to-one- situation like that used 
in th^ informal reading inventory (pp. 30-33) • 

Kenneth Goodman's suggestion seems more persuasive - 
to this writer because criterion-referenced test writers 
have surveyed the instruction in reading comprehension 
and have not yet been able to agree upon what 'is being 

. taught nor about what is approprialbe (learnable) for 
each age/grade/ability level, except perha:ps at the 
lower levels of word attack skills* (Even the so-called 

; word attack skills have been tested in widely divergent 

sequences and formats.) The vresearch of Davis and his 

■ ■ ' ^ ^ . . . ... ,^ ^ • . 

cohorts has not yigilded fruitful, isolatable skills and 

skill sequences. Thorndike, in 1917f did realize that 

h^ had to use reading passages, rather than isolated ^ 

items, that were progressively more ^difficult. This ^ 

■1' " ■ ■ • *' . • • . ■ 

strategy, of progressively more difficult passages, 

• ^ '^^ ' , ' - ^ ' ' 

is incorporated into the Informal reading inventory 
(Betts, 19^6) that Kenneth Goodman mentioned.^ It was 
mentioned earli^that Thorndike^s original test used 
o^en-eaded questions; , this question, format ipre sent ed 
problems for group testing, quite obviously 

Whatever test Is u6ed, however carefully the 
passages are arranged in order of "difficulty , and however 
closely the passages duplicate real-life reading situation 
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test lifriters will wish to have a more realistic appraisal 
of th^ appropriate reading tasks for each age /grade level, 
able to be Conducted as a group test and. able to. meet the • 
requirements for validity, reliability , and understandable 
norming data* 

Thus a jscheme is needed jCor testing comprehension , 
in a group setiting* Such a test would need to av^ld the 
gues's factor; Lt would also need to avoid the iridttentlon • 
faclyor that ma-j^^ from a testes being^^;oo long* 



Such a test might contain passages arranged, like Thorndike^s 
andj his successors*, in order of incre'asing. difficulty • 
TheLpassages should be writtfen refle<|ting> the best research 
^n jihtperest and degree "of complexity leading ^to comprehei:islon« 
Th^ passages could be diyided into 15-20 minute •*sitting3,** - 
pa'iikaged as separate booklets , \scored and normed separately. 

A student . could begin the^test at the level he wa's 
convinced would be very easy for him. The test score icould 
be calculated beginning with the student ^s highest basal 
ilievel (this might be defined as the level at which he^^^ 
jlnswers 7 of 8 consecutive questions correct, for example). 
?he test ^sittings could continue until the student, met' 
la ceiling, based upon research of the best possible level 
Npf error consistent with preventing the generalization of 
frustration. '^(This level might^ be 5 errors in 8 -^swers, 
more or less, but research might support allowing certa^ 
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students to continue above the 'celling, for further 
diagnostic purposes*) 

The use of the ceiling and basal scheme would, - 
hopefully, eliminate the inattention factor at the^ 
lower level (by removing the child from the task 
between sittings as well as* asking the child to start 
at a level estimated to be appropriate to his independent 
reading level), . oHen a good./reader is penalized at the 
lower end of the scale bv;belng required to answer question^ 
which are far too e'^^^r md to which he does not adequately . 
attend. The basa]/scheme might also help eliminate the 



guess factor a^ the hi^^er level for the good reader who 
ay be fatig^ied by the time he gets tV^ the end of the 



I 

mi 



test battery. It should also eliminate the guess factor ■ 

■ ■ - / ' , ^ ■ ■ ' ' ■■ ' ' ' ■ • 

foi7 the dhild who iis a pooJr'er reader who nevertheless is 

encouraged to try /to answer as many questions *'as you 

cari"JLn the time limit of traditional 'test's. The 

ceiling and basal scheme (well-established In lntelligenic.e 

testing history,, as well as in the history of;the insformal 

reading inventory) shoulfif isolate a body of.readlng 

material with which the child is comfortable^; this scheme 

should make the "grade level" designation more appropriate 

to everyjday reading tasks of the child. It would preserve 

th^ efficiency (and Mr. McCall's profits) of th9 group 

test\prmat. To be^less trivialjr It might make group test 

resviltsNnore interpretable for the. classroom teacher and: 
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for^ the specialist wishing to use the results in a more 
c6mpiete case study. The sclieme should also be more 
understandable to the* most Importarit person involved in 
the testing, the child*. 
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